Statistics 2 - Multiple Linear Regression, LASSO, and logistic regression

Learning objectives and motivation

This tutorial serves to give a brief overview of several techniques to explain data with one or more characteristics (covariates). The tutorial looks at three cases:

Multiple linear regression and the LASSO

This section revises univariate regression and gradually builds up to multivariate regression and the LASSO estimator.

Univariate regression

In the univariate case, we are interested in explaining a response \(Y\) with one variable \(X\) in a linear relationship: \[Y = a X + b.\] In this model, a single observation \(X\) is used to explain the response \(Y\). The coefficients \(a\) and \(b\) are unknown and are thus fitted to the data. As an example, consider the classic “mtcars” dataset in R:

head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

We suspect a linear relationship between the displacement (the volume of the pistons inside the engine; variable “disp”) and the weight “wt”. A plot of both shows a promising trend between the two variables:

plot(x=mtcars$disp, y=mtcars$wt, main="Weight as a function of engine displacement")

We fit a linear model:

linearMod <- lm(wt ~ disp, data=mtcars)
summary(linearMod)
## 
## Call:
## lm(formula = wt ~ disp, data = mtcars)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.89044 -0.29775 -0.00684  0.33428  0.66525 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.5998146  0.1729964   9.248 2.74e-10 ***
## disp        0.0070103  0.0006629  10.576 1.22e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4574 on 30 degrees of freedom
## Multiple R-squared:  0.7885, Adjusted R-squared:  0.7815 
## F-statistic: 111.8 on 1 and 30 DF,  p-value: 1.222e-11

The \(R\) summary gives us an estimate of the coefficient of the speed variable, together with a standard error, and the same for the intercept. This fixes the model at \(Y = 1.60 + 0.007 X\). Before using this model, it is crucial to check if the fitted coefficients are significant: To each variable, a null hypothesis can be associated which claims that “\(H_0\): The coefficient is zero” – We want to reject the hypothesis in favour of the alternative, thus making the coefficient significant (or non-zero/ non-trivial) in the model. In the above, we have a strong significance for both coefficients.

The fitted line is visualised as follows:

plot(x=mtcars$disp, y=mtcars$wt, main="Weight as a function of engine displacement")
x <- seq(0,600,length.out=100)
lines(x,1.60+0.007*x,col="red")

A simple linear regression can also be done easily by hand. For this we observe that by definition, the coefficients \(a\) and \(b\) in the model \(Y=aX+b\) have the property that they minimize \(\sum_{i=1}^n (y_i-ax_i-b)^2\) (the squared difference between regression line and observed \(y\) datapoints), where \((x_1,y_1),\ldots,(x_n,y_n)\) are the datapoints we are given. This so-called least squares minimisation can be solved analytically, leading to the estimates \[\begin{align*} \hat{a} &= \frac{\sum_{i=1}^n (x_i-\bar{x})(y_i-\bar{y})}{\sum_{i=1}^n (x_i-\bar{x})^2},\\ \hat{b} &= \bar{y}-\hat{a}\bar{x}, \end{align*}\] where \(\bar{x}\) and \(\bar{y}\) are the means of all \(x\) datapoints and all \(y\) datapoints, respectively.

Multiple linear regression

Often, several variables might explain the data. For example, in the “mtcars” dataset, we might suspect that the fuel consumption “mpg” might be dependent on the cylinder count “cyl”, displacement “disp”, horsepower “hp”, and weight “wt”. We thus fit a linear model of the type \[mpg = a*cyl + b*disp + c*hp + d*wt + e,\] where each of “mpg”, “cyl”, “disp”, “hp” and “wt” are vectors. In matrix notation, we can write our model in the form \[Y = aX+b,\] where \(X\) is a matrix with four columns containing the data of each of the four covariates. A fit yields

lfit <- lm(mpg ~ cyl + disp + hp + wt, data=mtcars)
summary(lfit)
## 
## Call:
## lm(formula = mpg ~ cyl + disp + hp + wt, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.0562 -1.4636 -0.4281  1.2854  5.8269 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 40.82854    2.75747  14.807 1.76e-14 ***
## cyl         -1.29332    0.65588  -1.972 0.058947 .  
## disp         0.01160    0.01173   0.989 0.331386    
## hp          -0.02054    0.01215  -1.691 0.102379    
## wt          -3.85390    1.01547  -3.795 0.000759 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.513 on 27 degrees of freedom
## Multiple R-squared:  0.8486, Adjusted R-squared:  0.8262 
## F-statistic: 37.84 on 4 and 27 DF,  p-value: 1.061e-10

Again, “cyl” is very weakly significant, the weight “wt” is highly significant, and so is the intercept term. We expect those three terms to be included in the final model. The other variables are not significant and might be omitted from the final model.

A standard problem in such scenarios is the following: What coefficient is small enough so that we can safely say that it is valid to neglect the variable in the model? This problem is known as variable selection. One way of addressing it is through hypothesis testing (as done with the \(^{*}\), \(^{**}\), \(^{***}\) notation). Another elegant way is the LASSO estimator introduced in the next section.

To end this subsection, the following 3D plot shows the relation of the variable “mpg” on the two significant variables “cyl” and “wt”:

evaluate <- function(x,y,f) {
  z <- matrix(nrow=length(x),ncol=length(y),0);
  for(i in 1:length(x)) {
    for(j in 1:length(y)) {
      z[i,j] <- f(x[i],y[j]);
    }
  }
  return(z);
}
require(plot3D)
## Loading required package: plot3D
## Warning: package 'plot3D' was built under R version 3.6.1
x <- mtcars$cyl
y <- mtcars$wt
persp3D(x,y,evaluate(x,y,function(x,y) -1.29*x-3.85*y+40.83),theta=50,phi=25,col="darkred",ticktype="detailed",shade=0.75,ltheta=10,alpha=0.5,xlim=c(2,10),ylim=c(1,4),zlim=c(10,35),xlab="cyl",ylab="wt",zlab="mpg")
points3D(x,y,mtcars$mpg,col="blue",pch=3,cex=1,add=T)

The LASSO

Main idea of LASSO: How can we do regression and variable selection at the same time?

The least absolute shrinkage and selection operator (LASSO) is a variation of the commonly used ordinary least squares (OLS) estimator \[\hat{\beta} = \arg\min_\beta \left( \frac{1}{n} \Vert Y-X\beta \Vert_2^2 \right),\] where \(n\) is the length of the response vector \(Y \in \mathbb{R}^n\), \(X \in \mathbb{R}^{n \times p}\) is the covariate matrix as introduced above, and \(\beta \in \mathbb{R}^p\) is the linear coefficient vector to be fitted. Precisely this estimator has been used to perform the univariate and multivariate fits above.

To address the problem of coefficients with almost non-zero values (which might or might not belong to the model), a penalty term is added to the OLS estimator: \[\hat{\beta}(\lambda) = \arg\min_\beta \left( \frac{1}{n} \Vert Y-X\beta \Vert_2^2 + \lambda \Vert \beta \Vert_1 \right).\] The penalty \(\Vert \beta \Vert_1=\sum_{i=1}^p |\beta_i|\) is weighted with the parameter \(\lambda>0\) (to be determined). The effect of the penalty term is as follows: If \(\lambda=0\), we obtain the OLS problem. The larger \(\beta\) is, the more we penalise non-zero coefficients, thus incentivising the fit to contain few non-zero coefficients. The interesting property of the LASSO is that the penalty makes some coefficients actually zero during the fitting process, and thus it performs variable selection and regression (on the non-zero variables) at the same time.

Though originally defined for least squares, LASSO regularisation is easily extended to a wide variety of statistical models including generalised linear models, generalised estimating equations, and proportional hazards models.

The package “glmnet” allows to apply the LASSO (use parameter “alpha=1” in function “glmnet”). The function returns a sequence of fitted models (fitted coefficients) for various values of \(\lambda\):

require(glmnet)
## Loading required package: glmnet
## Warning: package 'glmnet' was built under R version 3.6.1
## Loading required package: Matrix
## Loading required package: foreach
## Warning: package 'foreach' was built under R version 3.6.1
## Loaded glmnet 2.0-18
c <- glmnet(as.matrix(mtcars[-1]), mtcars[,1], standardize=T, alpha=1)
plot(c, xvar="lambda", xlab="log lambda", ylab="coefficient value", label=T)

The plot shows from right to left that as the value of \(\lambda\) decreases towards zero, more and more variables enter the model (their value becomes non-zero): Thus via \(\lambda\) we can control the sparsity of the model and select a model of suitable sparsity to us. Variable \(5\) can be regarded as most likely belonging to the model as it enters the model first. We can use the value of \(\lambda\) for a final fit at which no further variable enters the model and the coefficients of the existing variables in the model stabilise, e.g. \(\lambda=\exp(-4)\).

coef(c,s=exp(-4))
## 11 x 1 sparse Matrix of class "dgCMatrix"
##                        1
## (Intercept) 13.729626380
## cyl         -0.061825777
## disp         0.006714125
## hp          -0.017236445
## drat         0.831768528
## wt          -3.178127693
## qsec         0.708151007
## vs           0.233656522
## am           2.434565826
## gear         0.622354930
## carb        -0.380325766

The above values are the final fit (note that the argument for the function “coef” is called “s” instead of “lambda”). A better way of obtaining \(\lambda\) is the so-called “cross-validation”: see the references in the appendix.

The far left hand side of the “log lambda” plot above shows another crucial insight. Surely, the more variables we include, the better the fit becomes. However, we are prone to overfitting the model, not all variables are actually meaningful (or e.g. of several correlated variables we would only need to include one but we include them all), thus making the final result harder and harder to interpret. This is a classical tradeoff:

less variables more variables
model captures less characteristics of the data overfitting
easy to interpret hard to interpret

Logistic regression

The regression considered in the previous two sections considers \(Y \in \mathbb{R}^n\), that is the response is continuous. Logistic regression is applicable if the response is binary, such as pass/fail or win/lose outcomes. Logistic regression can also be generalised to binomial, ordinal, or multinomial scenarios.

Logistic regression analyses a model with binary outcomes by defining \(p=P(Y=1)\), the probability of obtaining a success (pass, win, etc.), and by fitting the transformation \[l = \log_b \frac{p}{1-p} = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots.\] As an example, we consider the example from wikipedia:

print(dat)
##    Hours Pass
## 1   0.50    0
## 2   0.75    0
## 3   1.00    0
## 4   1.25    0
## 5   1.50    0
## 6   1.75    0
## 7   1.75    1
## 8   2.00    0
## 9   2.25    1
## 10  2.50    0
## 11  2.75    1
## 12  3.00    0
## 13  3.25    1
## 14  3.50    0
## 15  4.00    1
## 16  4.25    1
## 17  4.50    1
## 18  4.75    1
## 19  5.00    1
## 20  5.50    1

The table shows the number of hours students work for an exam (first column) and the result of the examination (1=pass, 0=fail; second column). First, let’s plot the data:

plot(dat$Hours, dat$Pass, xlab="hours worked for the exam", ylab="pass=1, fail=0")
x <- seq(0,6,length.out=100)
lines(x, -0.1539 + 0.2346*x, col="blue")
lines(x, exp(-4.1+1.5*x)/(1+exp(-4.1+1.5*x)), col="red")

The figure also displays two possible fit curves: the best linear fit (blue) and the logistic fit (red).

The following shows how to obtain the red regression curve. We fit a linear model where, according to the logistic regression model, we treat the response variable as continuous quantity \(\log_b \frac{p}{1-p}\): In order words, we assume that the zeros and ones we observe (passes and fails) are not really zeros and ones, but realisations of some quantity \(\log_b \frac{p}{1-p}\) with underlying probability \(p\).

In \(R\) we compute the following:

logistic <- glm(Pass ~ Hours, data = dat, family="binomial")
summary(logistic)
## 
## Call:
## glm(formula = Pass ~ Hours, family = "binomial", data = dat)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.70557  -0.57357  -0.04654   0.45470   1.82008  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  -4.0777     1.7610  -2.316   0.0206 *
## Hours         1.5046     0.6287   2.393   0.0167 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 27.726  on 19  degrees of freedom
## Residual deviance: 16.060  on 18  degrees of freedom
## AIC: 20.06
## 
## Number of Fisher Scoring iterations: 5

(Note the flag “family=binomial” — this is because the binomial distribution encompasses the Bernoulli distribution with two outcomes as a special case, hence the \(R\) command is more general.)

The \(R\) summary displays the fit of the model \(\log \frac{p}{1-p} = \beta_0 + \beta_1 h\) to the data (since we only have one covariate), where \(h\) shall denote the number of hours worked for the exam and \(b=e\) is a fixed choice resulting in the natural logarithm (any other base works as well). From the fit we see that the model \(\log \frac{p}{1-p} = -4.1 + 1.5 h\) best describes the data. The red curve in the above plot was computed using those coefficients \(-4.1\) and \(1.5\).

Finally, we see that for any number of hours \(h\) worked, the log-odds of passing the exam are given directly by the fitted linear model, that is \[\log \frac{p}{1-p} = -4.1 + 1.5h,\] and thus the odds of passing the exam are (after applying “exp” to both sides): \[\frac{p}{1-p} = \exp(-4.1+1.5h).\] Solving for the probability \(p\) of passing the exam yields \[p = \frac{1}{1 + \exp \left[ -(\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots) \right]} = \frac{1}{1 + \exp \left[ -(-4.1+1.5h) \right]},\] thus for e.g. \(h=2\) hours of exam preparation, the probability of passing is \(0.75\).

Further Reading

References on the simple linear regression, LASSO, the Glmnet library used to fit LASSO models:

Simple linear regression by hand

Feature Selection using LASSO

Cross validation for LASSO

Glmnet

ANOVA

Statistics 3 - Cluster Analysis

This notebook was originally written in Jypiter notebooks.

Python in Rmarkdown more here..

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

#https://github.com/rstudio/rstudio/issues/4182 #to do with matplot

import os
# Set the path to be 'path-to_your_anaconda\anaconda3\Library\plugins\platforms'
os.environ['QT_QPA_PLATFORM_PLUGIN_PATH'] = 'C:\Anaconda3\Library\plugins\platforms'

Learning Objectives

  1. Understand the concept of data standardisation and recognise different approaches.
  2. Understand the purpose of cluster analysis and know the algorithms for k-means and hierarchical clustering.
  3. Understand how SSE and silhouette scores can be used to assess the quality of a clustering.

Basic Concept

include_graphics(".\\img\\statistics3\\Slide3.png")

df = pd.read_csv('.\\data\\tremor_data.csv')
df.head()
##    Time (mins)  Distance (m)  Size (Moment Magnitude)
## 0        14963         20518                      3.0
## 1        15229         35321                      1.8
## 2        17442         38120                      1.9
## 3        17902         51661                      2.0
## 4        18393         36197                      1.7

Data

summary_stats = df.describe()
np.round(summary_stats,decimals=1)
##        Time (mins)  Distance (m)  Size (Moment Magnitude)
## count        500.0         500.0                    500.0
## mean       30789.3       34287.1                      2.2
## std         5607.8       14433.0                      0.3
## min        14963.0         873.0                      1.2
## 25%        24949.2       23034.0                      1.9
## 50%        33131.0       32498.0                      2.2
## 75%        34561.5       48837.5                      2.4
## max        44697.0       73074.0                      3.2

Data

df.plot.scatter(x='Time (mins)',y='Distance (m)')
plt.show()

Data Standardisation

Why standardise data?

  • Data in different units cannot be meaningfully compared without scaling.
  • How you scale your data can be very important.
  • Without standardisation, clustering may be dominated by the variable with the greatest range.

Methods of Standardisation

For a practical application of min-max rescaling and IDR standardisation, see: https://www.ons.gov.uk/methodology/geography/geographicalproducts/areaclassifications/2011areaclassifications/methodologyandvariables )

df['Time (mins)'].hist()
plt.show()

df['Distance (m)'].hist()
plt.show()

We will choose z-score standardisation. Create new columns containing standardised versions of each of the variables and add these to the dataframe.

Save this dataframe as ‘new_tremor_data.csv’.

# Get means and standard deviations from the .describe() dataframe

stdevs = summary_stats.loc['std']
means = summary_stats.loc['mean']

# Create and add a standardised column for each column to the original dataframe :

for col in df.columns:
    df[col + '_standard'] = (df[col] - means[col])/ stdevs[col]

# Save:

df.to_csv('.\\data\\new_tremor_data.csv')
df.head()
##    Time (mins)  ...  Size (Moment Magnitude)_standard
## 0        14963  ...                          2.468986
## 1        15229  ...                         -1.091196
## 2        17442  ...                         -0.794514
## 3        17902  ...                         -0.497832
## 4        18393  ...                         -1.387878
## 
## [5 rows x 6 columns]
df.plot.scatter(x='Time (mins)_standard',y='Distance (m)_standard')
plt.show()

Clustering Algorithms

K-Means Clustering

Let’s apply k-means to the tremors data, using all three dimensions: time, distance and magnitude:

# Import the following packages:

import numpy as np
import sklearn.cluster as sklc

# Take the relevant standardised data from the dataframe we created earlier:

clustering_data = df.loc[:,['Time (mins)_standard','Distance (m)_standard','Size (Moment Magnitude)_standard']]

# Choose number of clusters, e.g.:

k = 6

# Perform k-means 20 times, returning the best results as an output object:

kmeans_output = sklc.KMeans(n_clusters=k, n_init=20).fit(clustering_data)

# Ask the output object for the information you need:

q6_cluster_ids = kmeans_output.labels_
q6_cluster_cns = kmeans_output.cluster_centers_

# Add the cluster ids to the dataframe:

df['cluster_ids_kmeans'] = q6_cluster_ids

# Inspect the results:
print("Cluster Centroids:")
## Cluster Centroids:
print(np.round(q6_cluster_cns,decimals=1))
## [[-1.4 -0.7  0.6]
##  [ 0.6 -0.9 -1.1]
##  [ 0.6  1.1  1.1]
##  [-1.4  1.  -0.8]
##  [ 0.7  0.7 -0.5]
##  [ 0.5 -0.7  0.4]]
df.head()
##    Time (mins)  ...  cluster_ids_kmeans
## 0        14963  ...                   0
## 1        15229  ...                   3
## 2        17442  ...                   3
## 3        17902  ...                   3
## 4        18393  ...                   3
## 
## [5 rows x 7 columns]

To understand these clusters, we need to visualise them:

# Let's visualise these clusters in 2D plots, taking two variables at a time.
# The plots show three views of the 3-dimensional data cloud.

colours = ['r','b','g','m','k','c','y'] * 12
icons   = ['o','x','s','.','v','^','<','>','*','+','D','d'] * 7


def plot_clusters_for_two_variables(data,data_clusters,var1,var2,fignum):

    plt.figure(fignum,figsize = (7,7))

    for i in range(k):
        
        plt.figure(fignum)
        data_this_cluster = data[data_clusters==i]
        
        plt.plot(data_this_cluster[var1],data_this_cluster[var2],colours[i] + icons[i])

        plt.gca().set_aspect('equal')
        plt.gca().set_xlim([-3,3])
        plt.gca().set_ylim([-3,3])
        
        plt.gca().set_xlabel(var1)
        plt.gca().set_ylabel(var2)

    plt.savefig('.\\img\\statistics3\\clusters_' + str(fignum) +'.png')

    plt.show()
    
cluster_ids = df['cluster_ids_kmeans']

plot_clusters_for_two_variables(clustering_data,cluster_ids,'Time (mins)_standard','Distance (m)_standard',10)

plot_clusters_for_two_variables(clustering_data,cluster_ids,'Time (mins)_standard','Size (Moment Magnitude)_standard',11)

plot_clusters_for_two_variables(clustering_data,cluster_ids,'Distance (m)_standard','Size (Moment Magnitude)_standard',12)

# Some 3D Plotting would be pretty handy here:

from mpl_toolkits.mplot3d import Axes3D

def plot_3D_clusters(data,data_clusters,var1,var2,var3,fignum):

    fig = plt.figure(fignum,figsize = (12,12))
    ax = fig.add_subplot(111, projection='3d')

    for i in range(k):
        
        plt.figure(fignum)
        data_this_cluster = data[data_clusters==i]
        
        ax.scatter(data_this_cluster[var1],data_this_cluster[var2],data_this_cluster[var3],c=colours[i],marker=icons[i])

        ax.set_aspect('equal')
        ax.set_xlim([-3,3])
        ax.set_ylim([-3,3])
        ax.set_zlim([-3,3])
        
        plt.gca().set_xlabel(var1)
        plt.gca().set_ylabel(var2)
        plt.gca().set_zlabel(var3)

    plt.savefig('.\\img\\statistics3\\3Dclusters_' + str(fignum) +'.png')

    plt.show()

cluster_ids = df['cluster_ids_kmeans']
plot_3D_clusters(clustering_data,cluster_ids,'Time (mins)_standard','Distance (m)_standard','Size (Moment Magnitude)_standard',12121)

Hierarchical Agglomerative Clustering

# Import the following packages:
import scipy.cluster.hierarchy as spch
num_clusters = 6
cluster_allocations_hmaxclust = spch.fclusterdata(clustering_data, num_clusters, metric='euclidean', method='single',criterion='maxclust')
# A quick 3D visualisation of these clusters:
cluster_ids = cluster_allocations_hmaxclust
plot_3D_clusters(clustering_data,cluster_ids,'Time (mins)_standard','Distance (m)_standard','Size (Moment Magnitude)_standard',12121)

# Plotting a Dendrogram:

Z = spch.linkage(clustering_data, method='single', metric='euclidean')
plt.figure(30,figsize = (10,10))
#semicols: https://stackoverflow.com/questions/28942969/is-there-a-python-equivalent-to-rs-invisible-function #prints extra stuff here.

# Use dummy variable '_' here to prevent output of functions being printed. We are not interested in the output here.
_ = spch.dendrogram(Z,p=20,truncate_mode='lastp',color_threshold=2,orientation='right');
_ = plt.gca().set_xlim([0.3,1.3])
_ = plt.gca().set_xticks(np.linspace(0.3,1.6,14))
plt.gca().set_xlabel('r')
## Text(0.5, 0, 'r')
plt.gca().set_ylabel('Cluster IDs')
## Text(0, 0.5, 'Cluster IDs')
plt.savefig('.\\img\\statistics3\\tremor_hagglom_dendrogram.png')
plt.show()

Measuring Clustering Quality

Method 1 - SSE

print('SSE for previous (3 variable) k-means clustering:', np.round(kmeans_output.inertia_,decimals=1))
## SSE for previous (3 variable) k-means clustering: 394.6

Method 2 - Silhouette Analysis

# Import the following package:
import sklearn.metrics as sklm

Shown above is a silhouette diagram for the two-dimensional data.

Let’s produce a silhouette diagram for the full three-dimensional tremors data:

# Set range of k values:

k_range = range(1,21)

# Create (almost) empty lists to store the output:
# (Cluster Allocations, SSE, Centroids, Silhouette scores)

# (Almost) empty, because we start with one piece of missing data in position zero.
# This is the data for zero clusters (i.e. an impossible case), and it means that the...
# ... indices of the elements will match the number of clusters that they refer to:

cluster_ids_series = [np.nan]
cluster_sse_series = [np.nan]
cluster_cns_series = [np.nan]
cluster_shs_series = [np.nan]

# Loop through k values:

for k in k_range:
    
    # Perform k means as before:
    
    km_output = sklc.KMeans(n_clusters = k, n_init = 20).fit(clustering_data)

    # Add the results to our lists:
    
    cluster_ids_series.append(km_output.labels_)
    cluster_sse_series.append(km_output.inertia_)
    cluster_cns_series.append(km_output.cluster_centers_)
    
    # Only add silhouettes if there is no error, since silhouettes cannot be calculated when, e.g...
    # ... there is only one cluster or where some points are in their own clusters:
    
    try:
        cluster_shs_series.append(sklm.silhouette_score(clustering_data,km_output.labels_))
    except:
        cluster_shs_series.append(0)

# Build a dictionary of the quantities we need to report on and convert to a dataframe.
# Also drop rows with missing data:
report_dict = {'SSE':cluster_sse_series,'Silhouette Score':cluster_shs_series}
report_df   = pd.DataFrame(report_dict,index=range(21))
report_df   = report_df.dropna(how='any')

# Find and report the optimal value of k:

optimal_k_by_silhouette_score = report_df['Silhouette Score'].idxmax()
optimal_silhouette_score      = report_df.loc[optimal_k_by_silhouette_score,'Silhouette Score']

print('The optimal number of clusters, as determined by silhouette analysis, is ' + str(optimal_k_by_silhouette_score) + ".")
## The optimal number of clusters, as determined by silhouette analysis, is 4.
print('The silhouette score for ' + str(optimal_k_by_silhouette_score) + " clusters is " + str(optimal_silhouette_score) + ".")
## The silhouette score for 4 clusters is 0.38042406738305873.
report_df
##             SSE  Silhouette Score
## 1   1497.000000          0.000000
## 2   1074.916372          0.335446
## 3    752.233210          0.349713
## 4    570.803247          0.380424
## 5    473.890034          0.338219
## 6    394.578336          0.357749
## 7    343.574863          0.355323
## 8    312.338837          0.342047
## 9    284.142239          0.348692
## 10   264.162639          0.340644
## 11   246.300908          0.329744
## 12   232.837426          0.326050
## 13   219.781091          0.333284
## 14   207.658150          0.328443
## 15   198.122470          0.329828
## 16   187.799038          0.336409
## 17   176.583066          0.339327
## 18   171.368191          0.338216
## 19   160.235693          0.328224
## 20   154.771687          0.338440
# We will add the optimal cluster ids to the main dataframe:

optimal_cluster_ids = cluster_ids_series[optimal_k_by_silhouette_score]
df['optimal_kmeans_cluster_ids'] = optimal_cluster_ids
# Again, for the sake of interest, let's create an elbow plot and a silhouette plot:

fignum = 20
plt.figure(fignum,figsize = (7,7))
## <Figure size 700x700 with 0 Axes>
plt.plot(report_df.index,report_df['SSE'],'b-')

#plt.gca().set_aspect('equal')
## [<matplotlib.lines.Line2D object at 0x000000002A8C1860>]
_ = plt.gca().set_xlim([0,20])
_ = plt.gca().set_xticks(range(21))
_ = plt.gca().set_ylim([0,1600])

plt.savefig('.\\img\\statistics3\\elbow_plot' + str(fignum) +'.png')
plt.show()

fignum = 21
plt.figure(fignum,figsize = (7,7))
## <Figure size 700x700 with 0 Axes>
plt.plot(report_df.index,report_df['Silhouette Score'],'ro')
## [<matplotlib.lines.Line2D object at 0x000000002A95FDA0>]
_ = plt.gca().set_xlim([0,20])
_ = plt.gca().set_xticks(range(21))
_ = plt.gca().set_ylim([-1,1])

plt.savefig('.\\img\\statistics3\\silhouette_plot' + str(fignum) +'.png')
plt.show()

Using the optimal k-means clustering, print summary statistics for each of the clusters (i.e. the mean, standard deviation, min and max of each variable).

# e.g. for kmeans:

# Create an empty list for the separate dataframes:

separate_cluster_dataframes = []

# Create a list of the cluster id numbers:

clusters = range(df['optimal_kmeans_cluster_ids'].max() + 1)

# For each cluster, take a reduced version of the main dataframe, filtered to contain only that cluster. 

for i in clusters:
    separate_cluster_dataframes.append(df[df['optimal_kmeans_cluster_ids'] == i])

# Create summary stats dataframes for each cluster:
    
separate_cluster_summary_stats = []

for i in clusters:
    separate_cluster_summary_stats.append(separate_cluster_dataframes[i].describe())
# e.g. for Cluster 0:

separate_cluster_summary_stats[0]
##         Time (mins)  ...  optimal_kmeans_cluster_ids
## count    147.000000  ...                       147.0
## mean   34318.428571  ...                         0.0
## std     2180.518714  ...                         0.0
## min    29107.000000  ...                         0.0
## 25%    32969.500000  ...                         0.0
## 50%    34137.000000  ...                         0.0
## 75%    35199.000000  ...                         0.0
## max    44193.000000  ...                         0.0
## 
## [8 rows x 8 columns]
plot_clusters_for_two_variables(clustering_data,df['optimal_kmeans_cluster_ids'],'Time (mins)_standard','Distance (m)_standard',100)

plot_clusters_for_two_variables(clustering_data,df['optimal_kmeans_cluster_ids'],'Time (mins)_standard','Size (Moment Magnitude)_standard',101)

plot_clusters_for_two_variables(clustering_data,df['optimal_kmeans_cluster_ids'],'Distance (m)_standard','Size (Moment Magnitude)_standard',102)

# A quick 3D visualisation of these clusters:
plot_3D_clusters(clustering_data,df['optimal_kmeans_cluster_ids'],'Time (mins)_standard','Distance (m)_standard','Size (Moment Magnitude)_standard',12121)